Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

178 ◾ Bioinformatics

λ(

)

~Poi

(5.12)

where

λ is the Poisson parameter which represents the rate of change in the count of the

gene g in a sample.

The Poisson distribution assumes that the rate of change is equal to the mean,

λ , which

is also equal to the variance.

(

)

= var

(5.13)

The probability that Y

p Y

(

)

−

;

(5.14)

To model the RNA-Seq count data with the Poisson distribution it requires that the mean

is equal to the variance. A key challenge is the small number of replicates in typical RNA-

Seq experiments (two or three replicates per condition). Therefore, inferential methods

that deal with each gene separately may suffer, in this case, from lack of power, due to

the high uncertainty of within-group variance estimates. This challenge can be overcome

either by grouping the count data into groups and then calculating the variance and the

mean in each group or by pooling information across genes by assuming the similarity of

the variances of different genes measured in the same experiment. In general, RNA-Seq

count data suffers from over-dispersion, where variance is greater than the mean. There

are a variety of software that use different technique for modeling the RNA-Seq count

data, but most of them use quasi-Poisson, negative binomial, or quasi-negative binomial

distribution, which deal with over-dispersed data.

The quasi-Poisson is similar to the Poisson distribution, but the variance is linearly cor-

related to the mean of the counts [31].

µ θ

(

)

~ Poi

(5.15)

θ µ

(

)⁼

var

(5.16)

where

θ is the dispersion parameter and

µ is the mean count.

In the negative binomial distribution, the variance is the function of the mean as

µ α

(

)

(5.17)

αµ

(

)⁼

var

(5.18)

where α is the dispersion parameter and P is an integer but commonly we use P = 2 (NB2

or quadratic model).